skip to main content


Search for: All records

Creators/Authors contains: "Shahabi, Cyrus"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Machine learning (ML) is playing an increasing role in decision-making tasks that directly affect individuals, e.g., loan approvals, or job applicant screening. Significant concerns arise that, without special provisions, individuals from under-privileged backgrounds may not get equitable access to services and opportunities. Existing research studies {\em fairness} with respect to protected attributes such as gender, race or income, but the impact of location data on fairness has been largely overlooked. With the widespread adoption of mobile apps, geospatial attributes are increasingly used in ML, and their potential to introduce unfair bias is significant, given their high correlation with protected attributes. We propose techniques to mitigate location bias in machine learning. Specifically, we consider the issue of miscalibration when dealing with geospatial attributes. We focus on {\em spatial group fairness} and we propose a spatial indexing algorithm that accounts for fairness. Our KD-tree inspired approach significantly improves fairness while maintaining high learning accuracy, as shown by extensive experimental results on real data. 
    more » « less
    Free, publicly-accessible full text available March 25, 2025
  2. Abstract

    Location-based alerts have gained increasing popularity in recent years, whether in the context of healthcare (e.g., COVID-19 contact tracing), marketing (e.g., location-based advertising), or public safety. However, serious privacy concerns arise when location data are used in clear in the process. Several solutions employ searchable encryption (SE) to achievesecurealerts directly on encrypted locations. While doing so preserves privacy, the performance overhead incurred is high. We focus on a prominent SE technique in the public-key setting–hidden vector encryption, and propose a graph embedding technique to encode location data in a way that significantly boosts the performance of processing on ciphertexts. We show that the optimal encoding is NP-hard, and we provide three heuristics that obtain significant performance gains: gray optimizer, multi-seed gray optimizer and scaled gray optimizer. Furthermore, we investigate the more challenging case of dynamic alert zones, where the area of interest changes over time. Our extensive experimental evaluation shows that our solutions can significantly improve computational overhead compared to existing baselines.

     
    more » « less
  3. Range aggregate queries (RAQs) are an integral part of many real-world applications, where, often, fast and approximate answers for the queries are desired. Recent work has studied answering RAQs using machine learning (ML) models, where a model of the data is learned to answer the queries. However, there is no theoretical understanding of why and when the ML based approaches perform well. Furthermore, since the ML approaches model the data, they fail to capitalize on any query specific information to improve performance in practice. In this paper, we focus on modeling "queries" rather than data and train neural networks to learn the query answers. This change of focus allows us to theoretically study our ML approach to provide a distribution and query dependent error bound for neural networks when answering RAQs. We confirm our theoretical results by developing NeuroSketch, a neural network framework to answer RAQs in practice. Extensive experimental study on real-world, TPC-benchmark and synthetic datasets show that NeuroSketch answers RAQs multiple orders of magnitude faster than state-of-the-art and with better accuracy. 
    more » « less
    Free, publicly-accessible full text available May 26, 2024
  4. Free, publicly-accessible full text available July 1, 2024
  5. Several "data-for-good" projects [1, 5, 12] initiated by major companies (e.g., Meta, Google) release to the public spatio-temporal datasets to benefit COVID-19 spread modeling [17, 47, 64] and understand human mobility [14, 24]. Most often, spatio-temporal data are provided in the form of snapshot high resolution population density information, where the released statistics capture population counts in small areas for short time periods. Since high resolution is required for utility (e.g., in modeling COVID hotspots) privacy risks are elevated. To prevent malicious actors from using the data to infer sensitive details about individuals, the released datasets must be first sanitized. Typically, [1, 5, 7, 12], differential privacy (DP) is employed as protection model, due to its formal protection guarantees that prevent an adversary to learn whether a particular individual's data has been included in the release or not. 
    more » « less
    Free, publicly-accessible full text available May 26, 2024
  6. Fairness in data-driven decision-making studies scenarios where individuals from certain population segments may be unfairly treated when being considered for loan or job applications, access to public resources, or other types of services. In location-based applications, decisions are based on individual whereabouts, which often correlate with sensitive attributes such as race, income, and education. While fairness has received significant attention recently, e.g., in machine learning, there is little focus on achieving fairness when dealing with location data. Due to their characteristics and specific type of processing algorithms, location data pose important fairness challenges. We introduce the concept of spatial data fairness to address the specific challenges of location data and spatial queries. We devise a novel building block to achieve fairness in the form of fair polynomials. Next, we propose two mechanisms based on fair polynomials that achieve individual spatial fairness, corresponding to two common location-based decision-making types: distance-based and zone-based. Extensive experimental results on real data show that the proposed mechanisms achieve spatial fairness without sacrificing utility. 
    more » « less
  7. Mobile apps that use location data are pervasive, spanning domains such as transportation, urban planning and healthcare. Important use cases for location data rely on statistical queries, e.g., identifying hotspots where users work and travel. Such queries can be answered efficiently by building histograms. However, precise histograms can expose sensitive details about individual users. Differential privacy (DP) is a mature and widely-adopted protection model, but most approaches for DP-compliant histograms work in a data-independent fashion, leading to poor accuracy. The few proposed data-dependent techniques attempt to adjust histogram partitions based on dataset characteristics, but they do not perform well due to the addition of noise required to achieve DP. In addition, they use ad-hoc criteria to decide the depth of the partitioning. We identifydensity homogeneityas a main factor driving the accuracy of DP-compliant histograms, and we build a data structure that splits the space such that data density is homogeneous within each resulting partition. We propose a self-tuning approach to decide the depth of the partitioning structure that optimizes the use of privacy budget. Furthermore, we provide an optimization that scales the proposed split approach to large datasets while maintaining accuracy. We show through extensive experiments on large-scale real-world data that the proposed approach achieves superior accuracy compared to existing approaches.

     
    more » « less
  8. As countries look toward re-opening of economic activities amidst the ongoing COVID-19 pandemic, ensuring public health has been challenging. While contact tracing only aims to track past activities of infected users, one path to safe reopening is to develop reliable spatiotemporal risk scores to indicate the propensity of the disease. Existing works which aim at developing risk scores either rely on compartmental model-based reproduction numbers (which assume uniform population mixing) or develop coarse-grain spatial scores based on reproduction number (R0) and macro-level density-based mobility statistics. Instead, in this article, we develop a Hawkes process-based technique to assign relatively fine-grain spatial and temporal risk scores by leveraging high-resolution mobility data based on cell-phone originated location signals. While COVID-19 risk scores also depend on a number of factors specific to an individual, including demography and existing medical conditions, the primary mode of disease transmission is via physical proximity and contact. Therefore, we focus on developing risk scores based on location density and mobility behaviour. We demonstrate the efficacy of the developed risk scores via simulation based on real-world mobility data. Our results show that fine-grain spatiotemporal risk scores based on high-resolution mobility data can provide useful insights and facilitate safe re-opening. 
    more » « less
  9. The goal of the crime forecasting problem is to predict different types of crimes for each geographical region (like a neighborhood or censor tract) in the near future. Since nearby regions usually have similar socioeconomic characteristics which indicate similar crime patterns, recent state-of-the-art solutions constructed a distance-based region graph and utilized Graph Neural Network (GNN) techniques for crime forecasting, because the GNN techniques could effectively exploit the latent relationships between neighboring region nodes in the graph if the edges reveal high dependency or correlation. However, this distance-based pre-defined graph can not fully capture crime correlation between regions that are far from each other but share similar crime patterns. Hence, to make a more accurate crime prediction, the main challenge is to learn a better graph that reveals the dependencies between regions in crime occurrences and meanwhile captures the temporal patterns from historical crime records. To address these challenges, we propose an end-to-end graph convolutional recurrent network called HAGEN with several novel designs for crime prediction. Specifically, our framework could jointly capture the crime correlation between regions and the temporal crime dynamics by combining an adaptive region graph learning module with the Diffusion Convolution Gated Recurrent Unit (DCGRU). Based on the homophily assumption of GNN (i.e., graph convolution works better where neighboring nodes share the same label), we propose a homophily-aware constraint to regularize the optimization of the region graph so that neighboring region nodes on the learned graph share similar crime patterns, thus fitting the mechanism of diffusion convolution. Empirical experiments and comprehensive analysis on two real-world datasets showcase the effectiveness of HAGEN. 
    more » « less